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Abstract. We present parallel and sequential dense QR factorization algorithms that are both 
optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as 
stable as Householder QR. 

We prove optimality by extending known lower bounds on communication bandwidth for se- 
quential and parallel matrix multiplication to provide latency lower bounds, and show these bounds 
apply to the LU and QR decompositions. We not only show that our QR algorithms attain these 
lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algo- 
rithms perform asymptotically more communication. We also point out recent LU algorithms in the 
literature that attain at least some of these lower bounds. 

1. Introduction. The large and increasing costs of communication motivate 
redesigning algorithms to avoid it whenever possible. In the parallel case, communi- 
cation refers to messages between processors, which may be sent over a network or 
via a shared memory. In the sequential case, communication refers to data movement 
between different levels of the memory hierarchy. In both the parallel and sequential 
cases we model the time to communicate a message of n words as a -I- /3n, where a 
is the latency and (3 is the reciprocal bandwidth. Many authors have pointed out 
technology trends causing floating point to become faster at an exponentially higher 
rate than bandwidth, and bandwidth at an exponentially higher rate than latency 
(see e.g., Graham et al. [23]). 

We present parallel and sequential dense QR factorization algorithms that are 
both optimal (sometimes only up to polylogarithmic factors) in the amount of com- 
munication (latency and bandwidth) they require, and just as numerically stable as 
conventional Householder QR. Some of the algorithms are novel, and some extend 
earlier work. The first set of algorithms, "Tall Skinny QR" (TSQR), are for matrices 
with many more rows than columns, and the second set, "Communication- A voiding 
QR" (CAQR), are for general rectangular matrices. The algorithms have significantly 
lower latency cost in the parallel case, and significantly lower latency and bandwidth 
costs in the sequential case, than existing algorithms in LAPACK and ScaLAPACK. 

It will be easy to see that our parallel and sequential TSQR implementations 
communicate as little as possible. To prove optimality of CAQR, we extend known 
lower bounds on communication bandwidth for sequential and parallel versions of 
conventional 0(?t.^) matrix multiplication (see Hong and Kung [28 and Irony, Toledo, 
and Tiskin [27]) to also provide latency lower bounds, and show that these bounds 
also apply to Q{n'^) implementations of dense LU and QR decompositions. Showing 
that the bounds apply to LU is easy, but QR is more subtle. We show that CAQR 
attains these lower bounds (sometimes only up to polylogarithmic factors). 

Implementations of TSQR and CAQR demonstrating significant speedups over 
LAPACK and ScaLAPACK will be presented in other work [T7]; here we concentrate 
on proving optimality. 

Tables [l~T}jl.6| summarize our performance models and lower bounds for TSQR, 
CAQR, and LAPACK's sequential and ScaLAPACK's parallel QR factorizations. Our 
model of computation looks the same for the parallel and sequential cases, with run- 
ning time = #fiops x time_per_flop -f- #words_moved x (1/bandwidth) -I- ^messages 
X latency, where the last two terms constitute the communication. We do not model 
overlap of communication and computation, which while important in practice can at 
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most improve the running time by a factor of 2, whereas we are looking for asymptotic 
improvements. In the tables we give the #flops, #words moved and ^messages as 
functions of the number of rows m and columns n (assuming m > n), the number 
of processors P in the parallel case, and the size of fast memory W in the sequential 
case. To make these tables easier to read, we omit most lower order terms, make 
boldface the terms where the new algorithms differ significantly from Sca/LAPACK, 
and make the optimal choice of matrix layout for each parallel algorithm: This means 
optimally choosing the block size b as well as the processor grid dimensions Pr x Pc 
in the 2-D block cyclic layout. (See Section [s] for discussion of these parameters, and 
detailed performance models for general layouts.) 

Tables |1.1H1.3| present the parallel performance models for TSQR, CAQR on 
general rectangular matrices, and CAQR on square matrices, respectively. First, 
Table |1.1| shows that parallel TSQR requires only logP messages, which is both 
optimal and a fact or 2n fewer messages than ScaLAPACK's parallel QR factorization 



PDGEQRF. Table 1.2 shows that parallel CAQR needs only Q{^ynP/m) messages 
(ignoring polylogarithmic factors) on a general m x n rectangular matrix, which is 
both optimal and a factor Q^y^mn/ P) fewer messages than ScaLAPACK. Note that 
y/mn/P is the square root of each processor's local memory size, up to a small 
constant factor. Table 1.3 presents the same comparison for the special case of a 
square n x n matrix. 

Next, Tables pTlf) 1 . 6 [ present the sequential performance models for TSQR, CAQR 
on general rectangular matrices, and CAQR on square matrices, respectively. Ta- 
ble [L4] compares sequential TSQR with sequential blocked Householder QR. This is 
LAPACK's QR factorization routine DGEQRF when fast memory is cache and slow 
memory is DRAM, and models ScaLAPACK's out-of-DRAM QR factorization routine 
PFDGEQRF when fast memory is DRAM and slow memory is disk. Sequential TSQR 
transfers fewer words between slow and fast memory: 2mn, which is both optimal and 
a factor mn/{AW) fewer words than transferred by blocked Householder QR. Note 
that mn/W is how many times larger the matrix is than the fast memory size W. 
Furthermore, TSQR requires fewer messages: at most about 3mn/W, which is close 
to optimal and 0(n) times lower than Householder QR. Table [TTs] compares sequential 
CAQR and sequential blocked Householder QR on a general rectangular matrix. Se- 
quential CAQR transfers fewer words between slow and fast memory: Q{mn'^ /^/W), 
which is both optimal and a factor Q{m/\/W) fewer words transferred than blocked 
Householder QR. Note that m/\/W = ^/mPjW is the square root of how many times 
larger a square mxm matrix is than the fast memory size W . Sequential CAQR also 
requires fewer messages: \2ran'^ /W'^^'^ , which is optimal. We note that our analysis 
of CAQR applies for any W , whereas our analysis of the algorithms in LAPACK and 
ScaLAPACK assume that at least 2 columns fit in fast memory, that is VF > 2m; 
otherwise they may communicate even more. Finally, Table 1.6 presents the same 
comparison for the special case of a square n x n matrix. 
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Table 1.1 

Performance models of parallel TSQR and ScaLAPACK's parallel QR factorization 
PDGEQRF on an m X n matrix with P processors, along with lower bounds on the number of 
flops, words, and messages. We assume m/P > n. 
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Table 1.2 



Performance models of parallel CAQR and ScaLAPACK's parallel QR factorization 
PDGEQRF on a m X n matrix with P processors, along with lower hounds on the number of 
flops, words, and messages. The matrix is stored in a 2-D Pr x Pc block cyclic layout with square 
b X b blocks. We choose b, Pr, and Pc optimally and independently for each algorithm. 
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Table 1.3 



Performance models of parallel CAQR and ScaLAPACK's parallel QR factorization 
PDGEQRF on a square n X n matrix with P processors, along with lower bounds on the num- 
ber of flops, words, and messages. The matrix is stored in a 2-D Pr X Pc block cyclic layout with 
square bxb blocks. We choose b, Pr, and Pc optimally and independently for each algorithm. 
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Table 1.4 



Performance models of sequential TSQR and blocked sequential Householder QR ( either LA- 
PACK's in-DRAM DGEQRF or ScaLAPACK's out-of-DRAM PFDGEQRF) on anmxn matrix 
with fast memory size W , along with lower bounds on the number of flops, words, and messages. 
We assume m'3> n and W > 3n^/2. W = W — n(n + l)/2, which is at least about ^W. 
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Table 1.5 

Performance models of sequential CAQR and blocked sequential Householder QR (either LA- 
PACK's in-DRAM DGEQRF or ScaLAPACK's out-of-DRAM PFDGEQRF) on anmxn matrix 
with fast memory size W, along with lower bounds on the number of flops, words, and messages. 
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Table 1.6 



Performance models of sequential CAQR and blocked sequential Householder QR (either LA- 
PACK's in-DRAM DGEQRF or ScaLAPACK's out-of-DRAM PFDGEQRF) on a square nxn 
matrix with fast memory size W , along with lower bounds on the number of flops, words, and 
messages. 



Finally, we note that although our new algorithms perform slightly more floating 
point operations than LAPACK and ScaLAPACK, they have the same highest order 
terms in their floating point operation counts. (For TSQR, which is intended for 
the case m ^ n, only the term containing m is highest order.) In fact we prove a 
matching lower bound on the amount of arithmetic, assuming we avoid "Strassen-like" 
algorithms in a way made formal later. 

Now we briefly describe related work and our contributions. The tree-based QR 
idea itself is not novel (see for example, [i El [H HU [531 [HI ISHl iOl SI]) , but we have 
a number of optimizations and generalizations: 

• Our algorithm can perform almost all its floating-point operations using any 
fast sequential QR factorization routine. For example, we can use blocked 
Householder transformation exploiting BLAS3, or invoke Elmroth and Gus- 
tavson's recursive QR (see [T8| IT9]). 

• We use TSQR as a building block for CAQR, for the parallel resp. sequential 
factorization of arbitrary rectangular matrices in a two-dimensional block 
cyclic layout. 

4 



• Most significantly, we prove optimality for both our parallel and sequential 
algorithms, with a 1-D layout for TSQR and 2-D block layout for CAQR, i.e., 
that they minimize bandwidth and latency costs. This assumes Q{n^) (non- 
Strassen-like) algorithms, and is usually shown in a Big-Oh sense, sometimes 
modulo poly logarithmic terms. 

• We describe special cases in which existing sequential algorithms by Elm- 
roth and Gustavson [19J and also LAPACK's DGEQRF attain minimum 
bandwidth. In particular, with the correct choice of block size, Elmroth's 
and Gustavson's RGEQRF algorithm attains minimum bandwidth and flop 
count, though not minimum latency. 

• We observe that there are alternative LU algorithms in the literature that 
attain at least some of these communication lower bounds: |23 describes a 
parallel LU algorithm attaining both bandwidth and latency lower bounds, 
and 137] describes a sequential LU algorithm that at least attains the band- 
width lower bound. 

• We outline how to extend both algorithms and optimality results to certain 
kinds of hierarchical architectures, either with multiple levels of memory hi- 
erarchy, or multiple levels of parallelism (e.g., where each node in a parallel 
machine consists of other parallel machines, such as multicore). In the case 
of TSQR we do this by adapting it to work on general reduction trees. 

It is possible to do a stable QR factorization (or indeed most any dense linear 
algebra operation) at the same asymptotic speed as matrix multiplication (e.g., in 
Q{n}°^^'^) operations using Strassen) [T5] and so with less communication as well, but 
we do not discuss these algorithms in this paper. 

We note that the Q factor will be represented as a tree of smaller Q factors, which 
differs from the traditional layout. Many previous authors did not explain in detail 
how to apply a stored TSQR Q factor, quite possibly because this is not required for 
solving a single least squares problem: Adjoining the right-hand side(s) to the matrix 
A, and taking the QR factorization of the result, requires only the R factor. Previous 
authors discuss this optimization. However, many of our applications require storing 
and working with the implicit representation of the Q factor. Our performance models 
show that applying this tree-structured Q has about the same cost as the traditionally 
represented Q. 

The rest of this report is organized as follows. Section |2] presents TSQR, de- 
scribing its parallel and sequential optimizations, performance models, comparisons 
to LAPACK and ScaLAPACK, and how it can be adapted to other architectures. 
Section |3] presents CAQR analogously. (This paper is based on the technical report 
[16], to which we leave many of the detailed derivations of the performance models.) 
Section [4] presents our lower bounds for TSQR, and Section [s] for CAQR (as well 
as LU). Section [6] describes related work. Section [t] summarizes and describes open 
problems and future work. 

2. Tall-Skinny QR - TSQR. In this section, we present the TSQR algorithm 

for computing the QR factorization of an m x n matrix A, stored in a 1-D block row 
layout. We assume m > n, and typically n. (See |5j for a description of ID and 
2D layouts.) 

Subsection |2.1| describes parallel TSQR on a binary tree, sequential TSQR on 
a "flat" tree, and then TSQR as a reduction on an arbitrary tree. Subsection |2.2| 
describes performance models, and Subsection |2.3| compares TSQR to alternative 
algorithms, both stable and unstable; we will see that TSQR does asymptotically 
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less communication than the stable alternatives, and is about as fast as the fastest 
unstable alternative (CholeskyQR). 

2.1. TSQR as a reduction operation. We will describe a family of algorithms 
that takes an m-by-n matrix A — [Aq; Ai; ■ ■ ■ ; and produces the R factor of its 

QR decomposition. Here we use Matlab notation, so that the Ai are stacked atop one 
another, and we assume Ai is mi-hy-n. In later sections we will assume rrii > n, but 
that is not necessary here. 

The basic operation in our examples is to take two or more matrices stacked atop 
one another, like A = [Ao;Ai], and replace them by the R factor of A. As long as 
more than one matrix remains in the stack, the reduction continues until one R factor 
is left, which we claim is the R factor of the original A. The pattern of which pairs 
(or larger groups) of matrices are combined in one step forms what we will call a 
reduction tree. 

We write this out explicitly for TSQR performed on a binary tree starting with 
p — 4: blocks. We start by replacing each Ai by its own individual R factor: 



A- 



Proceeding with the first set of reductions, we write 
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(2.2) 



Thus [Rq; Ri] is replaced by Rqi and [R2; R3] is replaced by i?23- Here and later, the 
subscripts on a matrix like Rij refer to the original Ai and Aj on which they depend. 
The next and last reduction is 
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(2.3) 



We claim that -R0123 is the R factor of the original A — [Aq; Ai; ^3]- To see this, 
we combine equations (2.11, (2.2 1 and (2.3 1 to write 
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(2.4) 



For this product to make sense, we must choose the dimensions of the Q factors 
consistently: They can all be square, or when all > n, they can all have n columns 
(in which case each R factor will be n-by-n). (The usual representation of Q factors 
by Householder vectors encodes both possibilities.) In either case, we have expressed 
A as a product of (block diagonal) orthogonal matrices (which must therefore also 
be orthogonal), and the triangular matrix -Roi23- By uniqueness of the QR decom- 
position (modulo signs of diagonal entries of i?oi23), this is the QR decomposition of 
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A. We note that we will not multiply the various Q factors together, but leave them 



represented by the "tree of Q factors" implied by equation (2.4 1. 

We abbreviate this algorithm with the following simple notation, which makes 
the binary tree apparent: 

^0 ^ i?o \ p 

^1 - i?i / """X „ 

The notation has the following meaning: if one or more arrows point to the same 
matrix, that matrix is the R factor of the matrix obtained by stacking all the matrices 
at the other ends of the arrows atop one another. This notation not only makes the 
parallelism in the algorithm apparent (all QR decompositions at the same depth in 
the tree can potentially be done in parallel) , but implies that any tree leads to a valid 
QR decomposition. For example, conventional QR decomposition may be expressed 
as the trivial tree 

Az^ 

The tree we will use for sequential TSQR with limited fast memory W is the 
following so-called "flat tree" : 

^0 '"Ra ~z^Ror~^Roi2~^Roi23 
Av 
Ar 

^3 

The idea of sequential TSQR is that if fast memory can only hold a little more 
than a fraction mj-p of the rows of A (a little more than m/4 for the above tree), 
then the algorithm proceeds by reading in the first m/p rows of A, doing its QR 
decomposition, keeping i?o in fast memory but writing the representation of Qo back 
to slow memory, and then repeatedly reading in the next m/p rows, doing the QR 
decomposition of them stacked below the R factor already in memory, and writing 
out the representation of the new Q factor. This way the entire matrix is read into 
fast memory once, and the representation of all the Q factors is written out to fast 
memory once, which is clearly the minimal amount of data movement possible. 

For an example of yet another TSQR reduction tree more suitable for a hybrid 
parallel / out-of-core factorization, see [HI Section 4.3]. 

It is evident that all these variants of TSQR are numerically stable, since they 
just involve repeated applications of orthogonal transformations. Note also that the 
local QR factorizations in both the parallel and sequential TSQR algorithms can 
avoid storing and performing arithmetic with zeros in the triangular factors. This 
optimization still allows the use of high-performance QR algorithms (such as the 
BLAS 3 YTY^ representation of Schreiber and Van Loan [44 and the recursive QR 
factorization of Elmroth and Gustavson [19 ) for the local computations. For details, 
see Demmel et al. [I6| Section 7]. 

We close this subsection by observing that the general theory of reduction op- 
erations applied to associative operators (e.g., optimizing the shape of the reduction 
tree |36j . or how to compute prefix sums of ai ★ a2 * • ■ ■ Cp where -k could be scalar 
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addition, matrix multiplication, etc.) applies to QR decomposition as well, because 
the mapping from [Ag;/!!] to its R factor is associative (modulo roundoff and the 
choice of the signs of the diagonal entries). 

2.2. Performance models for TSQR. We present performance models for 
parallel and sequential TSQR. We outline their derivations, which are straightforward 
based on the previous descriptions, and leave details to [111 Section 8]. In the next 
section we will compare the models for TSQR with alternative algorithms. 

The runtimes will be functions of m and n. In the parallel case, the runtime will 
also depend on the number of processors P, where we assume each processor stores 
m/P rows of the input matrix A. (It is easiest to think of the rows as contiguous, 
but if they are not, we simply get the QR decomposition of a row-permutation of 
A, which is still just the QR decomposition). In the sequential case the runtime will 
depend on W, the size of fast memory. We assume fast memory is large enough to 

contain at least n rows of A, and an R factor, i.e. W w In both parallel and 

sequential cases, we let 7 = time per flop, /3 — reciprocal bandwidth (time per word) 
and a — latency (time per message). We assume no overlap of communication and 
computation (as said before, this could speed up the algorithm at most 2x). All 
logarithms are in base 2. 

A parallel TSQR factorization on a binary reduction tree performs the following 
computations along the critical path: one local QR factorization of a fully dense 
m/P X n matrix, and log P factorizations, each of a 2n x n matrix consisting of two 
n X n upper triangular matrices. The factorization requires — h ^f- log P flops 
(ignoring lower order terms here and elsewhere) and logP messages, and transfers a 
total of logP words between processors. Thus, the total run time is 

Timepar. TSQR(m, n, P) = f — ^ + log P j 7 + ( 2*^^ log P j /5 + (log P) a . 

(2.5) 

Now we consider sequential TSQR. To flrst order, TSQR performs the same num- 
ber of floating point operations as standard Householder QR, namely 2mn^ — 2n^/3. 
As described before, sequential TSQR moves 2mn words by dividing A into subma- 
trices that are as large as possible, i.e., m' rows each such that m' ■ n + "^"^"^^-^ < W, 
or m' « {W ~ "("^+1) '^^jji — W/n, where W — W ~ liil^hii^ Assuming A is stored so 
that groups of m' rows are in contiguous memory locations, the number of messages 
sequential TSQR needs to send is = . Thus the runtime for sequential TSQR 
is 

Timescq. tsqr(to, ?^, W) = \ 2mn^ - — \ 7 -I- (2mn) (3 + ( ^ ) " ■ (2-6) 

We note that W « 2W/3, so that the number of messages 2mn/W « 3mn/W. 

2.3. Comparison of TSQR to alternative algorithms. We compare paral- 
lel and sequential QR to alternative algorithms, both stable and unstable: Classical 
Gram-Schmidt (CGS), Modified Gram-Schmidt (MGS), Cholesky QR, and House- 
holder QR, as implemented in LAPACK and ScaLAPACK; only the latter are nu- 
merically stable in all cases. In summary, TSQR not only has the lowest complexity 
(comparing highest order terms), but has asymptotically lower communication com- 
plexity than the only numerically stable alternatives. We outline our approach and 
leave details of counting to [16( Section 9]. 
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Table 2.1 

Performance models of various parallel QR algorithms for "tall-skinny" matrices, i.e. with 
n. We show only the best-performing versions of MGS (right-looking) and CGS (left-looking). 
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Table 2.2 

Performance models of various sequential QR algorithms for "tall-skinny" matrices, i.e. with 
m ^ n. PFDGEQRF is our model of ScaLAPACK's out-of-DRAM QR factorization; W is the 
fast memory size, and W = W — n{n + l)/2. Lower-order terms omitted. 



MGS and CGS can be either right-looking or left-looking. For CGS either alter- 
native has the same communication complexity, but for MGS the right- looking variant 
has much less latency, so we present its performance model. 

Cholesky QR forms A^A, computes its upper triangular Cholesky factor R, and 
forms Q — AR^^. It can obviously be unstable, but is frequently used when A is 
expected to be well-conditioned (see section |6]) . 

We need to say a little more about sequential Householder QR. LAPACK's right- 
looking DGEQRF repeatedly sweeps over the entire matrix, potentially leading to 
proportionally as much memory traffic as there are floating point operations, a factor 
Q{n) more than sequential TSQR; a left- looking version of DGEQRF would be similar. 
To make a fairer comparison, we model the performance of a left-looking QR algorithm 
that was optimized to minimize memory movement in an out-of-DRAM environment, 
i.e., where fast memory is DRAM and slow memory is disk. This routine, PFDGEQRF 
[IB] was designed to combine ScaLAPACK's parallelism with minimal disk accesses. 
As originally formulated, it uses ScaLAPACK's parallel QR factorization PDGEQRF 
to perform the current panel factorization in DRAM, but we assume here that it 
is running sequentially since we are only interested in modeling the traffic between 
slow and fast memory. PFDGEQRF is a left-looking method, as usual with out-of- 
DRAM algorithms (left-looking schemes do fewer writes than right-looking schemes, 
since writes are often more expensive.) PFDGEQRF keeps two panels in memory: a 
left panel of fixed width b, and the current panel being factored, whose width c can 
expand to fill the available memory. Details of the algorithm and analysis may be 
found in |13j and |16l Appendix F] , where we choose b and c to minimize disk traffic; 
we summarize the performance model in Table |2.2| 



Examining Table 2.1 we see that all parallel algorithms have the same highest 
order term in their flop counts, p" , and also use the same bandwidth, ^ log P, but 
that parallel TSQR sends 2n times fewer messages than the only stable alternative 



(PDGEQRF), and is about as fast as the fastest unstable method (Cholesky QR). In 
other words, only parallel TSQR is simultaneously fastest and stable. 

Examining Table |2.2[ we see a similar story, with sequential TSQR sending about 
times fewer words and j times fewer messages than the only stable alternative, 
PFDGEQRF. Note that ^ is how many times larger the entire matrix is than fast 
memory. Since we assume W > n? , the number of words TSQR sends is less than the 
number of words Cholesky QR sends. 

3. Communication- Avoiding QR - CAQR. We present the CAQR algo- 
rithm for computing the QR factorization of an m-by-n matrix A, with m > n. In 
the parallel case A is stored on a two-dimensional grid of processors P = Pr x Pc 
in a 2-D block-cyclic layout, with blocks of dimension b x b. We assume that all the 
blocks have the same size; we can always pad the input matrix with zero rows and 
columns to ensure this is possible. In the sequential case we also assume A is stored 
in a Pr X Pc 2-D blocked layout, with individual p^-by--^ blocks stored contiguously 
in memory. For a detailed description of the 2-D block cyclic layout, see [S]. 

Stated most simply, parallel (resp. sequential) CAQR simply implements the 
right-looking QR factorization using parallel (resp. sequential) TSQR as the panel 
factorization. The rest is bookkeeping. 

Section |3.1| discusses parallel CAQR in more detail, and comparing performance 
to ScaLAPACK. We also show, given m, n and P, to choose Pr, Pc and b to minimize 
running times of both algorithms; our proof of CAQR's optimality depends on these 



choices. Section 3.2 does the same for sequential CAQR and an out-of-DRAM algo- 
rithm from ScaLAPACK, whose floating point operations are counted sequentially. 
Subsection |3.3| discusses other sequential QR algorithms, including showing that re- 
cursive QR routines of Elmroth and Gustavson [19j also minimize bandwidth, though 
possibly not latency. 

3.1. Parallel CAQR. We describe a few details most relevant to the complexity 
but refer the reader to [13 Section 13] for details. At the j-th step of the algorithm, 
parallel TSQR is used to factor the panel of dimension m — ( j — l)6-by-6, whose top 
left corner is at matrix diagonal entry {j — 1)5+ 1. We assume for simplicity that 
the rrij ~ m — (j — l)b rows are distributed across all Pr processors in the processor 
column. When we do parallel TSQR on the panel, all the at most local rows of 
the panel stored on a processor are factored together in the first step of TSQR. After 
the panel factorization, we multiply the transpose of the Q factor times the trailing 
submatrix as follows. First, the Householder vectors representing the Q factor of the 

local rows of the panel are broadcast to all the processes in the same processor 
row, and applied to their submatrices in an embarrassingly parallel fashion. Second, 
the Householder vectors Y of the smaller Q factors in TSQR's binary reduction tree 
are independently broadcast along their processor rows, and the updates to the b rows 
in each pair of processors are performed in parallel, with the triangular T factor of 
the block Householder transformation / — YTY"'" being computed by one of the two 
processors, and with the two processors exchanging only b rows of data. 

Table [3T] summarizes the operation counts, including divisions counted separately, 
as well as a similar model for ScaLAPACK's PDGEQRF for comparison. We make 
the following observations. Parallel CAQR does slightly more flops than ScaLAPACK 
(but only in lower order terms), and sends nearly the same of words (actually very 
slightly fewer). But CAQR reduces the 3nlogPr term in ScaLAPACK's message 
count by a factor of b, and so can reduce the overall message count by as much as a 
factor of b (depending Pr and Pc). Thus by increasing the block size &, we can lower 
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the number of messages by a large factor. But we can't raise b arbitrarily without 
increasing the flop count; next we show how to choose the parameters b, and 
to minimize the runtime. 





Parallel CAQR 


# messages 


f logP. + f logP, 


# words 


( + "2") logP, + (^™-"V2 + 2n] 


logPe 


# flops 


2n^{3m-n) , i,n^ , 3bn(2m~n) , /'4b^n , (3i)+5) '\ ,„„ d j,2„ 


3P ' 2Pe 2Pr ' y 3 


' 2Pe 1 r •J II- 


# divisions 


+ "2" (logP, 1) 




ScaLAPACK's PDGEQRF 


# messages 


3nlogP, + ^logPc 


# words 


["p +hn) logP,+ ('""-"'/^ + "2") 


log Pc 


# flops 


2n^{3m-n) , bn^ , 3bn(2m-n) 

3P ' 2Pc ' 2Pr 3Pr 


# divisions 


mn — n'^ /2 
P^ 



Table 3.1 



Performance models of parallel CAQR and ScaLAPACK's PDGEQRF when factoring anmxn 
matrix, m > n, distributed in a 2-D block cyclic layout on a Pr X Pc grid of processors with square 
b X b blocks. All terms are counted along the critical path. In this table exclusively, "flops" only 
includes floating-point additions and multiplications, not floating-point divisions, which are shown 
separately. Some lower-order terms are omitted. 



When choosing b, Pj., and P^ to minimize the runtime, they must satisfy the 
following conditions: 



l<Pr,Pc<P , PrPc = P , l<b< — and 1 < & < — (3.1) 

_rv Pc 

For simplicity we will assume that Pr evenly divides m and that evenly divides n. 
Example values of b, Pr, and Pc which satisfy the constraints in Equation (3.1 1 are 



Pr = 



mP 
n 



Pr = 



and 6 = 



These values are chosen simultaneously to minimize the approximate number of words 
sent, in? / Pc + mn/Pr, and the approximate number of messages, bn/b, where for 
simplicity we temporarily ignore logarithmic factors and lower-order terms in Table 



3.1[ This suggests using the following ansatz: 

1 



Pr^K 



mP 
n 



Pr^ 



K 



and b — B ■ 



(3.2) 



for general values of K and B < mm{K,l/K}, since we can thereby explore all 
possible values of b, Pr and Pc satisfying (3.1 1. 

Using the substitutions in Equation (3.2 1, the flop count (neglecting lower-order 
terms, including the division counts) becomes 



mn 



B' 



3B 
1{ 



BK\ 



n 
1^ 



3B 
2K 

' log ( X • 



P 



4B2 
~3~ 



3BK\ 



(3.3) 
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We wish to choose B and K so as to minimize the flop count. We know at least 
that we need to eliminate the dominant mn^ log(. . . ) term, so that parallel CAQR 
has the same asymptotic flop count as ScaLAPACK's PDGEQRF. This is because we 
know that CAQR performs at least as many floating-point operations (asymptotically) 
as PDGEQRF, so matching the highest-order terms will help minimize CAQR's flop 
count. 

To make the high-order terms of (3.3 1 match the 2mn^/P — 2ii? /{'dP) flop count 



of ScaLAPACK's parallel QR routine, while minimizing communication as well, we 
can pick K = \ and 





for simplicity we will use 

B^^o^-^\^j■^j (3.4) 

although B could be multiplied by some positive constant. 

The above choices of B and K make the flop count as follows, with some lower- 
order terms omitted: 

^ - 3P + Plog(^) ^^-^^ 

Thus, we can choose the block size 6 so as to match the higher-order terms of the flop 
count of ScaLAPACK's parallel QR factorization PDGEQRF. 



Using the substitutions in Equations (3.2) and (3.4 1 with K = 1, the number of 
messages becomes 

(3.6) 




Using the substitutions in Equation (3.2 1 and (3.4 1, the number of words trans 



ferred between processors on the critical path, neglecting lower-order terms, becomes 



Jiri^ , ^ 1 ^ I nP 




(3.7) 

The results of these computations are shown in Table |3.2| which also shows the 
results for ScaLAPACK, whose analogous analysis appears in [TB] Section 15], and 
the communication lower bounds, which are discussed in Section |5] 

3.2. Sequential CAQR. As stated above, sequential CAQR is just right-looking| 

QR factorization with TSQR used for the panel factorization. (In fact left-looking QR 
with TSQR has the same costs ,16., Appendix C], but we stick with the right-looking 
algorithm for simplicity.) We also assume the m-by-n matrix A is stored in a Pj- X Pf^ 
2-D blocked layout, with individual ;p^-by--^ blocks stored contiguously in memory, 
with m> n and jr > jr- 
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Parallel CAQR w/ optimal 6, Pr, Pc 


iiops 

# messages 

# words 
Optimal b 
Optimal Pr 
Optimal Pc 


2mn^ 2n^ 
P 3P 

i\/7s^log2(=^r)-log (^^V^^j 
y^filog-2 (mZ) 

/ inP 
1 nP 

Y 




PDGEQRF w/ optimal 6, P^, Pc 


# nops 

# messages 

# words 

OTlflTTlfll /) 

Optimal Pj. 
Optimal Pc 


2mn^ 2ra^ 
P . 3P 

?log(^)log(=^) + f log(^) 
y^logP-iy^log(^) 

V P ^"fe V n / 

/ mP 

/ rtP 

v 




Theoretical lower bound 


# messages 

# words 


/ nP 

V 2iim 

/ mn'^ 

V 211P 



Table 3.2 



Highest-order terms in the performance models of parallel CAQR, ScaLAPACK's PDGEQRF, 
and theoretical lower bounds for each, when factoring an m X n matrix, distributed in a 2-D block 
cyclic layout on a Pr X Pc grid of processors with square b X b blocks. All terms are counted along 
the critical path. The theoretical lower bounds assume that n > 2^^m/P, i.e., that the matrix is 
not too tall and skinny. In summary, if we choose b, Pr, and Pc independently and optimally for 
both algorithms, the two algorithms match in the number of flops and words transferred, but CAQR 
sends a factor of &{^mn/ P) messages fewer than ScaLAPACK QR. This factor is the local memory 
requirement on each processor, up to a small constant. 



For TSQR to work as analyzed we need to choose Pr and Pc large enough for 
one such jr-^y--p- block to fit in fast memory, plus a bit more. For CAQR we will 
need to choose P,- and Pc a bit larger, so that a bit more than 3 such blocks fit in 
fast memory; this is in order to perform an update on two such blocks in the trailing 

2 

matrix given Householder vectors from TSQR occupying ™'^p + words, or at most 
altogether. In other words, we need < or P > 

Leaving details to [TSJ Appendix C], we summarize the complexity analysis by 



Tseq. CAQR(m, n,P„P,) < ( ^-P{Pc - 1) 



a - 



^-mn {Pc + \n^p)j P (3.8) 

where we have ignored lower order terms, and used P,. as an upper bound on the 
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number of blocks in each panel since this only increases the run time slightly, and is 
simpler to evaluate than for the true number of blocks Pr — [{J — l)^p^J- 

Now we choose P, Pr and Pc to minimize the runtime. From the above formula 
for Tscq. caqr(w, n, Pc, -Pr), we see that the runtime is an increasing function of Pr 
and Pc, so that we would like to choose them as small as possible, within the limits 
imposed by the fast memory size P > ^^^r- So we choose P = ^^^^ (assuming here 
and elsewhere that the denominator evenly divides the numerator) . But we still need 
to choose Pr and Pc subject to Pr ■ Pc = P- 

Examining T^cq. caqr('71i "-i ^c, ^r) again, we see that if P is fixed, the runtime 
is also an increasing function of Pc, which we therefore want to minimize. But we 
are assuming jr > jr, or Pc > The optimal choice is therefore Pc = or 

Pc = \J which also means jr = js-, i.e., the blocks in the algorithm are square. 
This choice of Pr — and Pc = ^= therefore minimizes the runtime, yielding 



Tscq. CAQRim,n,W) < 12 



2mn^ - ) 7. (3.9) 



We note that the bandwidth term is proportional to , and the latency term is 
W times smaller, both of which match (to within constant factors), the lower bounds 
on bandwidth and latency to be described in Section |5] 

The results of this analysis are shown in Table |3.3[ which also shows the results 
for an out-of-DRAM algorithm PFDGEQRF from ScaLAPACK, whose internal block 
sizes h and c have been chosen to minimize disk traffic, and where we count the floating 
point operations sequentially (see 16, Appendix F]); it can also be thought of as a 
hypothetical model for an optimized left-looking version of LAPACK's DGEQRF. 



3.3. Other Bandwidth Minimizing Sequential QR Algorithms. In this 

section we describe special cases in which previous sequential QR algorithms also 
minimize bandwidth, although they do not minimize latency. In particular, we discuss 
two variants of Elmroth's and Gustavson's recursive QR (RGEQR3 and RGEQRF 
[H]), as well as LAPACK's DGEQRF. 

The fully recursive routine RGEQR3 is analogous to Toledo's fully recursive LU 
routine |17j: Both routines factor the left half of the matrix (recursively), use the 
resulting factorization of the left half to update the right half, and then factor the 
right half (recursively again). The base case consists of a single column. The output 
of RGEQR3 applied to an m-by-n matrix returns the Q factor in the form /— YTY'^ , 
where Y is the m-by-n lower triangular matrix of Householder vectors, and T is 
an n-by-n upper triangular matrix. A simple recurrence for the number of memory 
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Sequential CAQR w/ optimal Pc, -Pc 


# flops 

# messages 

# words 
Opt. P 

Opt. Pr 
Opt. Pc 


2 

o mn 

Amn/W 
2m/VW 
2n/VW 




ScaLAPACK's PFDGEQRF w/ optimal b, c 


# flops 

# words 
Opt. 6 
Opt. c 


2'mii? — '^n? 

mn^ I 2mn 
2W W 

m n mn , 3mn 3n 
2W 6W 2 4 

1 

^ m 




Theoretical lower bound 


# messages 

# words 


371"^ (m— I) 
16(^8VK3)i/2 
3«-(m-|) „^ 
16(8VK)i/2 i^*^ 



Table 3.3 

Highest-order terms in the performance models of sequential CAQR, ScaLAPACK's out-of- 
DRAM QR factorization PFDGEQRF running on one processor, and theoretical lower bounds for each, 
when factoring an m X n matrix with a fast memory capacity of W words. 



references of either RGEQR3 or Toledo's algorithm is 

P(m, f ) + P(m - f , t) + O(^) if mn > and 71 > 1 
B{m, n) = { rnn if mn < W 

m if TO > and n = 1 

2P(m, f) + 0(^) if mn>M^andn> 1 
^ \ mn if mn < W 



TO 

t2 



if m > and n = I 



Tf) n 

= 0{^) + mn (3.10) 



W 

So RGEQR3 attains our bandwidth lower bound. (The mn term must be included 
to account for the case when n < VW, since each of the mn matrix entries must be 
accessed at least once.) However, RGEQR3 does a factor greater than one times as 
many floating point operations as sequential Householder QR. 

Now we consider RGEQRF and DGEQRF, which are both right-looking algo- 
rithms and differ only in how they perform the panel factorization (by RGEQR3 and 
DGEQR2, resp.). Let b be the width of the panel in either algorithm. It is easy to see 
that a reasonable estimate of the number of memory references just for the updates 
by all the panels is the number of panels j times the minimum number of memory 

references for the average size update 6(max(TOn, ^=)), or @(ina,x{^^^ , ^^))- Thus 
we need to pick b at least about as large as VW to attain the desired lower bound 
O(^). 



Concentrating now on RGEQRF, we get from inequality (3.101 that the ^ panel 
factorizations using RGEQR3 cost at most an additional 
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■ + mb]) = 0(^= + mn) memory references, or 0{mn) if we pick b = 
VW. Thus the total number of memory references for RGEQRF with b = \/W is 
0{'^^^^7= + mn) which attains the desired lower bound. 

Next we consider LAPACK's DGEQRF. In the worst case, a panel factorization 
by DGEQR2 will incur one slow memory access per arithmetic operation, and so 

• m6^) = 0{mnb) for all panel factorizations. For the overall algorithm to be 

guaranteed to attain minimal bandwidth, we need mnb = 0(^=), or & = 0{^^). 



Since b must also be at least about vW, this means W = 0{n), or that fast memory 
size may be at most large enough to hold a few rows of the matrix, or may be much 
smaller. 

RGEQR3 does not alway minimize latency. For example, considering applying 
RGEQR3 to a single panel with n = \fW columns and m > W rows, stored in a 
block-column layout with VW-by-VW blocks stored columnwise, as above. Then a 
recurrence for the number of messages RGEQR3 requires is 



i:(m,f) + L(m-t,f)+0(^) ifn>l 



777 77 , 

O(^) = 0(m) when n=VW 
WW 



L{m, n) 



which is larger than the minimum O(^) — 0[^^) attained by sequential TSQR 
when n = VW. 

In contrast to DGEQRF, RGEQRF, and RGEQR3, CAQR minimizes flops, band- 
width and latency for all values of W. 

4. Lower Bounds for TSQR. We present communication lower bounds for 
TSQR. As we already mentioned for the sequential case, it is obviously necessary to 
read mn words from from slow to fast memory (the input) , and write mn words from 
fast to slow memory (the output), for a lower bound of 2mn words moved. Sequential 
TSQR attains this trivial lower bound. Since the size of a message is bounded by the 
size of fast memory W, it clearly requires at least messages to send this much 

data. Since TSQR sends = '^ Tj'l+i) « messages, it attains this bound to 

within a constant factor, and is very close when W ^ n'^. 

For parallel TSQR, the lower bound on latency is obviously logP, since TSQR 
needs to compute a nontrivial function of data that is spread over P processors, and a 
binary reduction tree of depth log P clearly minimizes latency (by using the butterfly 
variant). Parallel TSQR attains this lower bound too. 

Bandwidth lower bounds for parallel TSQR are more interesting. We analyze 
this in a way that applies to more general situations, starting with the following: 
Suppose processor 1 and processor 2 each own some of the arguments of a function 
/ that processor 1 wants to compute. What is the least volume of communication 
required to compute the function? We are interested in smooth functions of real or 
complex arguments, and so will use techniques from calculus rather than modeling 
the arguments as bit strings. 

In this way, we will derive necessary conditions on the function / for it to be 
evaluable by communicating fewer than all of its arguments to one processor. We will 
apply these conditions to various linear algebra operations to capture our intuition 
that it is in fact necessary to move all the arguments to one processor for correct 
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evaluation of /: Subsection |4 . 1 1 will show that if / is a bijection as a function of the n 
arguments on processor 2, and if processor 2 can only send one message to processor 



1, then it indeed has to send all n arguments (part 3 of Lemma 4.1 1. Subsection 4.2 



extends this to reduction operations where each processors sends one message to 
its parent in a reduction tree, which is the case we are considering in this paper. 
Subsection |4.3| goes a step further and asks whether less data can be sent overall by 
allowing processors 1 and 2 to exchange multiple but smaller messages; the answer is 
sometimes yes, but again not for the reduction operations we consider. 

4.1. Communication lower bounds for one-way communication between] 
2 processors. Suppose x'™^ G M™ is owned by processor 1 (PI) and y*-"^ G M" is 
owned by P2; we use superscripts to remind the reader of the dimension of each 
vector- valued variable or function. Suppose PI wants to compute /(''^(x^™^ y'"^) : 
jjm ^ T^n _^ -^r^ -y^g gj.g|. g^g]^ j-^^^ much information P2 has to send to PI, assuming 

it is allowed to send one message, consisting of n < n real numbers, which them- 
selves could be functions of y^"-*. In other words, we ask if functions h^-\y^'^'>) : 
K" m and F('')(a;('"\z(^^)) : M™ x M^^ ^ M'', exist such that /('■)(x('"), = 
i^W(x('"),/i(=Hy^"^))- When n = n, the obvious choice is to send the original data 
2/*^"^ so that /i(-^(y(")) — y'"^ is the identity function and /^''^ — F^^\ The interesting 
question is whether we can send less information, i.e. n < n. 

Unless we make further restrictions on the function h we are allowed to use, it is 
easy to see that we can always choose n — 1, i.e. send the least possible amount of 
information: We do this by using a space-filling curve |43] to represent each y'"-' G K^"^ 
by one of several preimages y G M. In other words, h'^^\y^^^) maps y*^"^ to a scalar y 
that PI can map back to y^"-* by a space filling curve. This is obviously unreasonable, 
since it implies we could try to losslessly compress n 64-bit fioating point numbers into 
one 64-bit fioating point number. However, by placing some reasonable smoothness 
restrictions on the functions we use, since we can only hope to evaluate (piecewise) 
smooth functions in a practical way anyway, we will see that we can draw useful 
conclusions about practical computations. To state our results, we use the notation 
Jxf{x,y) to denote the r x m Jacobian matrix of /^'"'' with respect to the arguments 
^.(m)^ Using the above notation, we state 

Lemma 4.1. Suppose it is possible to compute /^''^(a;*-'"^ y^"^) on PI by com- 
municating n<n words /i'--'(y("^) from P2 to PI, and evaluating /'^'"'(x'™-', y*-"-*) = 
F^'^^{x^"^\h^-\y^"'^)). Suppose /i^-' and F^"^") are continuously differ entiable on open 
sets. Then necessary conditions for this to be possible are as follows. 

1. Given any fixed y^"-' in the open set, then for all x^"^'' in the open set, the 
rows of Jyf{x,y) must lie in a fixed subspace of R" of dimension at most 
n < n. 

2. Given any fixed y^-^ G M- satisfying y^-^ — h'^-\y^^^) for some y'"-* in the 
interior of the open set, there is a set C C M" containing y^"\ of dimension 
at least n — n, such that for each x, f{x, y) is constant for y ^ C . 

3. If r — n, and for each fixed x, (x, y*-"-*) is a bijection, then it is necessary 
and sufficient to send n words from P2 to PI to evaluate f . 

Proof. Part 1 is proved simply by differentiating, using the chain rule, and noting 
the dimensions of the Jacobians being multiplied: 

implying that for all x, each row of Jy'^^™'/'''^ (x, y) lies in the space spanned by the 
n rows of Jy-^"''/i^-'(y). 
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of 



Then P2 has to communicate at least c(c+ l)/2 words to PI, and it is 



Part 2 is a consequence of the implicit function theorem. Part 3 follows from part 
2, since if the function is a bijection, then there is no set C along which / is constant. 

□ 

Either part of the lemma can be used to derive lower bounds on the volume of 
communication needed to compute /(x, y), for example by choosing an n equal to the 
lower bound minus 1, and confirming that either necessary condition in the Lemma 
is violated, at least in some open set. 

We illustrate this for a simple matrix factorization problem. 
Corollary 4.2. Suppose PI owns the r\xc matrix Ai, and P2 owns the r2 x c 
matrix Ai, with r-i > c. Suppose PI wants to compute the cx c Cholesky factor R 
of ■ R = Af ■ A\ + ■ A2, or equivalently the R factor in the QR decomposition 
Ai 
A2 

possible to communicate this few, namely either the entries on and above the diagonal 
of the symmetric cx c matrix ' ^2, or the entries of its Cholesky factor R, so that 
R^ ■ R = A2 ■ A2 (equivalently, the R factor of the QR factorization of A2). 

Proof. That it is sufficient to communicate the c(c + 1) /2 entries described above 
is evident. We use Corollary 1 to prove that these many words are necessary. We use 
the fact that mapping between the entries on and above the diagonal of the symmet- 
ric positive definite matrix and its Cholesky factor is a bijection (assuming positive 
diagonal entries of the Cholesky factor). To see that for any fixed Ai, f{Ai,R) = 
the Cholesky factor of A{ ■ Ai + R^ • i? is a bijection, note that it is a composition 
of three bijections: the mapping from R to the entries on and above the diagonal of 
Y = A2 ■ A2, the entries on and above the diagonal of Y and those on and above the 
diagonal of X = ■ Ai + Y, and the mapping between the entries on and above the 
diagonal of X and its Cholesky factor f{Ai,R). □ 

4.2. Reduction operations. We can extend this result slightly to make it apply 
to the case of more general reduction operations, where one processor PI is trying to 
compute a function of data initially stored on multiple other processors P2 through Ps. 
Wc suppose that there is a tree of messages leading from these processors eventually 
reaching PI. Suppose each Pi only sends data up the tree, so that the communication 
pattern forms a DAG (directed acylic graph) with all paths ending at PI. Let Pz's 
data be denoted j/^"^ Let all the variables on PI be denoted and treat all the 

other variables on the other processors as constants. Then exactly the same analysis 
as above applies, and we can conclude that every message along the unique path from 
Pi to PI has the same lower bound on its size, as determined by Lemma 1. This 
means Corollary 1 extends to include reduction operations where each operation is a 
bijection between one input (the other being fixed) and the output. In particular, it 
applies to TSQR. 

We emphasize again that using a real number model to draw conclusions about 
finite precision computations must be done with care. For example, a bijective func- 
tion depending on many variables could hypothetically round to the same floating 
point output for all floating point inputs, eliminating the need for any communica- 
tion or computation for its evaluation. But this is not the case for the functions we 
are interested in. 

Finally, we note that the counting must be done slightly differently for the QR 
decomposition of complex data, because the diagonal entries Ri j are generally taken 
to be real. Alternatively, there is a degree of freedom in choosing each row of R, which 
can be multiplied by an arbitrary complex number of absolute value 1. 
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4.3. Extensions to two-way communication. While the result of the pre- 
vious subsection is adequate for the results of this paper, we note that it may be 
extended as follows. For motivation, suppose that PI owns the scalar x, and wants 
to evaluate the polynomial X]"=i yi^^~^j where P2 owns the vector y^"\ The above 
results can be used to show that P2 needs to send n words to PI (all the coefficients 
of the polynomial, for example) . But there is an obvious way to communicate just 2 
words: (1) PI sends x to P2, (2) P2 evaluates the polynomial, and (3) P2 sends the 
value of the polynomial back to PI. 

More generally, one can imagine k phases, during each of which PI sends one 
message to P2 and then P2 sends one message to PI. The contents of each message 
can be any smooth functions of all the data available to the sending processor, either 
originally or from prior messages. At the end of the fc-th phase, PI then computes 
f{x,y)- 

More specifically, the computation and communication proceeds as follows: 

• In Phase 1, PI sends g["^'\xS"'^) to P2 

• In Phase 1, P2 sends h["'\j/'^\ g["''\x'^"^^)) to PI 

• In Phase 2, PI sends '^'(x^"), to P2 

• In Phase 2, P2 sends h^^'\y^''\ g["''\x^"'^) , gi"''\x^'^\ h^{''\y^''\ g['^'\x^"'^)))) 
to PI 

• ... 

• In Phase k, PI sends g'j^''\x^'^\h'^'\. . .),h^^'\. ..),... ,h^^_Y\. . .)) to 
P2 

• In Phase k, P2 sends • ■),92"'\- ■■),■■ • • ■ )) to PI 

• PI computes 

Lemma 4.3. Suppose it is possible to compute /('''(.t^"'^ r/*^"^) on PI by the 
scheme described above. Suppose all the functions involved are continuously differen- 
tiable on open sets. Let n = J2i=i ^'^^ ^ — Si=i "^i- Then necessary conditions 
for this to be possible are as follows. 

1. Suppose n < n and m < m, ie. P2 cannot communicate all its information 
to PI, but PI can potentially send its information to P2. Then there is a 
set Cx C of dimension at least m — m and a set Cy C K" of dimension 
at least n — n such that for {x,y) G C = x Cy, the value of f{x,y) is 
independent of y. 

2. If r = n = m, and for each fixed x or fixed y, f'-^\x^"^\y^^^) is a bijection, 
then it is necessary and sufficient to send n words from P2 to PI to evaluate 
/• 

Proof. We define the sets Cx and Cy by the following constraint equations, one 
for each communication step in the algorithm: 

• ffi™^^ = is a fixed constant, placing mi smooth constraints on 

x^™\ 

• In addition to the previous constraint, /i^"^^ = /iJ"^^(t/("\ gi^^\x^'^^)) is a 
fixed constant, placing m smooth constraints on y^"'\ 
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• In addition to the previous constraints, 

= 5("^)(a;("),/iJ"^^(y("),(?i"^^(a;('")))) is a fixed constant, placing 
more smooth constraints on x'™^ . 

• In addition to the previous constraints, 

constant, placing n2 more smooth constraints on 

• ... 

• In addition to the previous constraints, 
~g(rn.) ^ ),..., is a fixed con- 

stant, placing lUk more smooth constraints on x^™). 

• In addition to the previous constraints, 

^ . . ..),.. . . . )) is a fixed constant, 

placing Hk more smooth constraints on y^"^. 
Altogether, we have placed n — X]i=i ni < n smooth constraints on y*-"^ and m = 
X]i=i — smooth constraints on x'™', which by the implicit function theorem 
define surfaces Cy{h^i^^\ . . . , and Cx{g^^\ ■ ■ ■ of dimensions at least 

n — n> and m — m > 0, respectively, and parameterized by . . . , /j^"*"'} and 

{g["^^ \ ■ . . respectively. For a; S C^, and y G Cy, the values communicated 

by PI and P2 are therefore constant. Therefore, for x G Cx and y E Cy, f{x,y) = 
F{x, hi, ... , hk) depends only on x, not on y. This completes the first part of the 
proof. 

For the second part, we know that if f(x,y) is a bijection in y for each fixed 
X, then by the first part we cannot have n < n, because otherwise f{x,y) does not 
depend on y for certain values of x, violating bijectivity. But if we can send n = n 
words from P2 to PI, then it is clearly possible to compute f{x, y) by simply sending 
every component of j/*^") from P2 to PI explicitly. □ 

Corollary 4.4. Suppose PI owns the c-by-c upper triangular matrix Ri, 
and P2 owns the c-by-c upper triangular matrix R2, and PI wants to compute the 

' Ri ' 
R2 

communicate c(c + l)/2 words from P2 to PI (in particular, the entries of R2 are 
sufficient). 

We leave extensions to general communication patterns among multiple processors 
to the reader. 

5. Lower Bounds for CAQR. In this section, we review known lower bounds 
on communication bandwidth for parallel and sequential Q(n^) matrix-matrix multi- 
plication of matrices stored in 2-D layouts, extend some of them to the rectangular 
case, and then extend them to LU and QR, showing that our sequential and paral- 
lel CAQR algorithms have optimal communication complexity with respect to both 
bandwidth (in a Big-Oh sense, and sometimes modulo polylogarithmic factors). 

We will also use the simple fact that if i? is a lower bound on the number of words 
that must be communicated to implement an algorithm, and if W is the size of the 
local memory (in the parallel case) or fast memory (in the sequential case), so that W 
is the largest possible size of a message, then B/W is a lower bound on the latency, 
i.e. the number of messages needed to move B words into or out of the memory. We 
use this to derive lower bounds on latency, which are also attained by our algorithms 
(again in a Big-Oh sense, and sometimes modulo polylogarithmic factors). 



R factor in the QR decomposition of 



Then it is necessary and sufficient to 



We begin in section 5.1 by reviewing known communication complexity bounds 
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for Q{n^) matrix multiplication, due first to Hong and Kung [28 in the sequential 
case, and later proved more simply and extended to the parallel case by Irony, Toledo 
and Tiskin p7] . 

It is easy to extend lower bounds for matrix multiplication to lower bounds for 
LU decomposition via the following reduction of matrix multiplication to LU: 

(5.1) 

See [23] for an implementation of parallel LU that attains these bounds. See RE7| for 
an implementation of sequential LU and a proof that it attains the bandwidth lower 
bound (whether the latency lower bound is attained is an open problem). 

It is reasonable to expect that lower bounds for matrix multiplication will also 
apply (at least in a Big-Oh sense) to other one-sided factorizations, such as QR. As 
we will see, QR is not as simple as LU. 

All this assumes commutative and associative reorderings of conventional Q{n^) 
matrix multiplication, and so excludes faster algorithms using distributivity or special 
constants, such as those of Strassen |46] or Coppersmith and Winograd [9], and their 
use in asymptotically fast versions of LU and QR [15 . Extending communication 
lower bounds to these asymptotically faster algorithms is an open problem. 

5.1. Matrix Multiplication Lower Bounds. We review lower bounds in [5H1 
[27] for multiplication of two n-hy-n matrices C — A ■ B using commutative and 
associative (but not distributive) reorderings of the usual Q{n^) algorithm. In the 
sequential case, they assume that A and B initially reside in slow memory, that there 
is a fast memory of size W < n? , and that the product C = A - B must be computed 
and eventually reside in slow memory. They bound from below the number of words 
that need to be moved between slow memory and fast memory to perform this task: 

# words moved > — = ■ W ^ — ^ . (5.2) 

Since only W words can be moved in one message, this also provides a lower bound 
on the number of messages: 

Tl Tl 

# messages > — — ■ 1 ~ — — — . (5.3) 

In the rectangular case, where A is n-by-r, B is r-by-m, and C is n-by-m, so that the 
number of arithmetic operations in the standard algorithm is 2mnr, the above two 
results still apply, but with replaced by mnr. 

The parallel case is considered in [27' . There is actually a spectrum of algorithms, 
from the so-called 2D case, that use little extra memory beyond that needed to store 
equal fractions of the matrices A, B and C (and so about 3ti^/P words for each 
of P processors, in the square case), to the 3D case, where each input matrix is 
replicated up to P^/^ times, so with each processor needing memory of size t? jP'^l'^ 
in the square case. We only consider the 2D case, which is the conventional, memory 
scalable approach. In the 2D case, with square matrices. Irony et al show that if each 
processor has /in^/P words of local memory, and P > 32/^^, then at least one of the 
processors must send or receive at least the following number of words: 

Tl 

# words sent or received > — = (5.4) 

" 4\/2(^P)i/2 ^ ' 
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and so using at least the following number of messages (assuming a maximum message 
size of n? / P): 



pl/2 

# messages > — ^ — . (5.5) 

4^(/^)3/2 ^ ' 

We wish to extend this to the case of rectangular matrices. We do this in prepa- 
ration for analyzing CAQR in the rectangular case. The proof is a simple extension 
of Thm. 4.1 in [27\. 

Theorem 5.1. Consider the conventional matrix multiplication algorithm ap- 
plied to C = A ■ B where A is n-hy-r, B is r-hy-m, and C is n-by-m, implemented 
on a P processor distributed memory parallel computer. Let n, rh and f be the sorted 
values of n, m, and r, i.e. fi > rh > r. Suppose each processor has 3hm/P words of 
local memory, so that it can fit 3 times as much as 1 / P-th of the largest of the three 
matrices. Then as long as 



r>\j—^ (5.6) 

(i.e. none of the matrices is "too rectangular") then the number of words at least one 
processor must send or receive is 



\ TiTn ■ r 

# words moved > — , (5.7) 
" V96P 



and the number of messages is 



VP ■ f 

# messages > (5.8) 
VoD4nm 



Proof. We use (5.2 1 with mnr/P substituted for n'^, since at least one processor 



does this much arithmetic, and W ~ 3nm/P words of local memory. The constants 



in inequality (5.6 1 are chosen so that the first term in (5.2 1 is at least 2W, and half 
the first term is a lower bound. □ 

It is well-known that the communication lower bound for sequential matrix mul- 
tiplication is attained by "tiling" or "blocking" the matrices into square blocks of 
dimension and for parallel matrix multiplication by Cannon's algorithm [8]. 

5.2. Lower Bounds for CAQR. Now we need to extend our analysis of matrix 
multiplication. We assume all variables are real; extensions to the complex case are 
straightforward. Suppose A — QR is m-by-n, n even, so that 

Q^-A^ (q(1 : 1 : ^))'^ . A(1 : m, I + 1 : n) = i?(l : ^, ^ -I- 1 : n) ^ i? . 

It is easy to see that Q depends only on the first ^ columns of A, and so is inde- 
pendent of A. The obstacle to directly applying existing lower bounds for matrix 
multiplication of course is that Q is not represented as an explicit matrix, and ■ A 
is not implemented by straightforward matrix multiplication. Nevertheless, we ar- 
gue that the same data dependencies as in matrix multiplication can be found inside 
many implementations of Q"^ ■ A, and that therefore the geometric ideas underlying 
the analysis in [27 still apply. Namely, there are two data structures Q and A indexed 
with pairs of subscripts (j, i) and (j, k) respectively with the following properties. 
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A stores A as well as all intermediate results which may overwrite A. 

Q represents Q, i.e., an m-by-^ orthogonal matrix. Such a matrix is a member 

of the Stiefel manifold of orthogonal matrices, and is known to require ^ — 

^(^ + 1) independent parameters to represent, with column i requiring m — i 

parameters, although a particular algorithm may represent Q using more 

data. 

The algorithm operates mathematically independently on each column of A, 
i.e., methods like that of Strassen are excluded. This means that the algo- 
rithm performs at least ^ ~ f ( f + 1) m ultiplications on each m-dimensional 



column vector of A (see subsection 5.3 for a proof), and does the same oper- 
ations on each column of A. 

• For each (i, k) indexing Ri^k, which is the component of the fc-th column A. 
of A in the direction of the i-th column Q. i of Q, it is possible to identify 
at least m — i common components of A^^k and of Qi.i such that a parameter 
associated with Qj^i is multiplied by a value stored in Aj^k- 

The last point, which says that Q"^ • A has at least the same dependencies as matrix 
multiplication, requires illustration. 

• Suppose Q is represented as a product of ^ Householder reflections with a pro- 
jection Q onto the first ^ coordinates, Q = {I—TiUiuJ) ■ ■ ■ T„/2'"n/2'"^/2)^'| 
normalized in the conventional way where the topmost nonzero entry of each 
Uj is one, and Q consists of the first n/2 columns of the n-by-n identity ma- 
trix. Then Qjj = Ui{j) is multiplied by some intermediate value of Aj^k, i-e. 

• Suppose Q is represented as a product of block Householder transformations 
(J — ZiUi) •••(/ — ZfUj)Q where Ug and Zg are m-by-6g matrices, Ug con- 
sisting of bg Householder vectors side-by-side. Again associate with the 
j-th entry of the i-th Householder vector Ui{j). 

• Recursive versions of QR [TF apply blocked Householder transformations 
organized so as to better use BLAS3, but still let us use the approach of the 
last bullet. 

• Suppose Q is represented as a product of ^ — f (f -I- 1) Givens rotations, 
each one creating a unique subdiagonal zero entry in A which is never filled 
in. There are many orders in which these zeros can be created, and possibly 
many choices of row that each Givens rotation may rotate with to zero out 
its desired entry. If the desired zero entry in Aj,i is created by the rotation 
in rows j' and j, j' < j, then associate Qj ^ with the value of the cosine in 
the Givens rotation, since this will be multiplied by Aj_k- 

• Suppose, finally, that we use CAQR to perform the QR decomposition, so 
that Q — Qi ■ ■ ■ QfQ, where each Qg is the result of TSQR on bg columns. 
Consider without loss of generality Qi, which operates on the first bi columns 
of A. We argue that TSQR still produces m — i parameters associated with 
column i as the above methods. Suppose there are P row blocks, each of 
dimension ^-hy-bi. Parallel TSQR initially does QR independently on each 
block, using any of the above methods; we associate multipliers as above with 
the subdiagonal entries in each block. Now consider the reduction tree that 
combines q different 6i-by-6i triangular blocks at any particular node. This 
generates {q — -I- l)/2 parameters that multiply the equal number of 
entries of the q — I triangles being zeroed out, and so can be associated with 
appropriate entries of Q. Following the reduction tree, we see that parallel 
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TSQR produces exactly as many parameters as Householder reduction, and 
that these may be associated one-for-one with all subdiagonal entries of Q{: 
, 1 : &i) and A{:, 1 : bi) as above. Sequential TSQR reduction is analogous. 
We see that we have only tried to capture the dependencies of a fraction of the 
arithmetic operations performed by various QR implementations; this is all we need 
for a lower bound. 

Now we resort to the geometric approach of [27 : Consider a three dimensional 
block of lattice points, indexed by k). Each point on the (z, 0, k) face is associated 
with Ri,k, for 1 < < ^. Each point on the (0,j, fc) face is associated with Aj^k, 
for 1 < fc < ^ and \ < j < ni. Each point on the {i,j,0) face is associated with Qj^i, 
for 1 < z < 1^ and I < j < m. Finally, each interior point k) for 1 < i, A: < f and 
1 < j < TO represents the multiplication Qj.i'Aj^k- The point is that the multiplication 
at {i,j,k) cannot occur unless Qj^i and Aj ^ are together in memory. 

Finally, we need the Loomis- Whitney inequality [M] : Suppose F is a set of lattice 
points in 3D, Vi is projection of V along i onto the (j, k) plane, and similarly for Vj 
and Vk- Let \V\ denote the cardinality of V, i.e. counting lattice points. Then 
I^P < %\ ■ \Vj\ ■ \Vk\. We can now state 

Lemma 5.2. Suppose a processor with local (fast) memory of size W is partici- 
pating in the QR decomposition of an m-by-n matrix, m > n, using an algorithm of 
the sort discussed above. There may or may not be other processors participating (i.e. 
this lemma covers the sequential and parallel cases). Suppose the processor performs 
F multiplications. Then the processor must move the following number of words into 
or out of its memory: 

F 

# of words moved > ~ ^ (5-9) 

using at least the following number of messages: 

F 

# of messages > 3 - 1 (5.10) 



Proof. The proof closely follows that of Lemma 3.1 in [57]. We decompose the 
computation into phases. Phase I begins when the total number of words moved into 
and out of memory is exactly IW. Thus in each phase, except perhaps the last, the 
memory loads and stores exactly W words. 

The number of words Uj^ from different Aj^. that the processor can access in its 
memory during a phase is 2W , since each word was either present at the beginning 
of the phase or read during the phase. Similarly the number of coefficients ng from 
different Qji also satisfies ng < 2W. Similarly, the number nn of locations into 
which intermediate results like Qji ■ Ajf^ can be accumulated or stored is at most 2W. 
Note that these intermediate results could conceivably be stored or accumulated in A 
because of overwriting; this does not affect the upper bound on n^. 

By the Loomis- Whitney inequality, the maximum number of useful multiplica- 
tions that can be done during a phase (i.e. assuming intermediate results are not 
just thrown away) is bounded by y/nj^ ■ uq ■ < \/'8W^. Since the processor does 
F multiplications, the number of full phases required is at least 



F 
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so the total number of words moved is W times larger, i.e. at least 

F 

# number of words moved > , — W . 

~ Vsw 

The number of messages follows by dividing by W, the maximum message size. □ 
The following is our main result for sequential CAQR: 

Corollary 5.3. Consider a single processor computing the QR decomposition 
of an m-by-n matrix with m > n, using an algorithm of the sort discussed above. 
Then the number of words moved between fast and slow memory is at least 

# of words moved > ^ ^^^^\% - W > j^^^ " W (5.11) 
using at least the following number of messages: 

#of„.ess^es> .,^4%" ' -l>y^-l ,5.12) 



Proof. The proof follows easily from Lemma |5.2| by using the lower bound 
F > ^ ^(^ + 1) on the number of multiplications by any algorithm in the 



class discussed above (see Lemma 5.5 in subsection 5.3 for a proof). □ 

The lower bound could be increased by a constant factor by using a specific 

number of multiplications (say mn? — n^/3 using Householder reductions), instead 

of arguing more generally based on the number of parameters needed to represent 

orthogonal matrices. 

Comparing to the performance model in Section 3.2 especially Table 3.3 we see 

that sequential CAQR attains these bounds to within a constant factor. 
The following is our main result for parallel CAQR: 

Corollary 5.4. Consider a parallel computer with P processors and W words 
of memory per processor computing the QR decomposition of an m-by-n matrix with 
m > n, using an algorithm of the sort discussed above. Then the number of words 
sent and received by at least one processor is at least 

# of words moved > — , ^ — - - W> ) - W (5.13) 

^ - P(8M^)V2 - 16p(8iy)i/2 ^ ' 

using at least the following number of messages: 

^-ir(5 + l) 3n2(TO-|) , , 

In particular, when each processor has W — mn/P words of memory and the matrix 
is not too rectangular, n > ^ p™ , then the number of words sent and received by at 
least one processor is at least 



mn 

# of words moved > \j (5.15) 



using at least the following number of messages: 



nP 

# of messages > W —r--. — . (5.16) 

V 2^^771 
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In particular, in the square case m = n, we get that as long as P > 2^^, then the 
number of words sent and received by at least one processor is at least 



n2 



# of words moved > 2ii/2pi/2 

(5.17) 

using at least the following number of messages: 

# of messages > \ — jj . (5.18) 



Proof. The result follows from the previous Corollary, since at least one processor 
has to do 1/P-th of the work. □ 



Comparing to the performance model in Section 3.1 especially Table 3.2 



that parallel CAQR attains these bounds to within a constant factor. 

5.3. Lower Bounds on Flop Counts for QR. This section proves lower 
bounds on arithmetic for any "columnwise" implementation of QR, by which we mean 
one whose operations can be reordered so as to be left looking, i.e. the operations that 
compute columns i of Q and R depend on data only in columns 1 through i of A. The 
mathematical dependencies are such that columns i oi Q and R do only depend on 
columns 1 through i of A, but saying that operations only depend on these columns 
eliminates algorithms like Strassen. (It is known that QR can be done asymptotically 
as fast as any fast matrix multiplication algorithm like Strassen, and stably [15 .) 

This section says where the lower bound on F comes from that is used in the 
proof of Corollary |5.3| above. 

The intuition is as follows. Suppose A — QR is m-by-(j + 1), so that 

Q^-A= (Q(l : m, 1 : j)f ■ A{1 : m, j + 1) = R{1 : j, j + 1) = R . 

where Q only depends on the first j columns of A, and is independent of A. As an 
arbitrary m-by-j orthogonal matrix, a member of the Stiefel manifold of dimension 
™i ~ i(j + l)/2, Q requires mj ~ j{j + l)/2 independent parameters to represent. We 
will argue that no matter how Q is represented, i.e. without appealing to the special 
structure of Givens rotations or Householder transformations, that unless mj — j{j + 
l)/2 multiplications are performed to compute R it cannot be computed correctly, 
because it cannot depend on enough parameters. 

Assuming for a moment that this is true, we get a lower bound on the number of 
multiplications needed for QR on an m-by-n matrix by summing X]j=i ['^J ^ Jij + 

l)/2] = IT ^" 0{mn). The two leading terms are half the multiplication count 

for Householder QR (and one fourth of the total operation count, including additions). 
So the lower bound is rather tight. 

Again assuming this is true, we get a lower bound on the value F in Corollary |5.3| 
by multiplying f • (mf - f (f + l)/2) = - ^^(f + 1) < F. 

Now we prove the main assertion, that mj — j{j + l)/2 multiplications are needed 
to compute the single column R — ■ A, no matter how Q is represented. We model 
the computation as a DAG (directed acyclic graph) of operations with the following 
properties, which we justify as we state them. 

1. There are m input nodes labeled by the m entries of A, ai through am,j+i- 
We call these A-input nodes for short. 
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2. There are at least mj — j{j + l)/2 input nodes labeled by parameters repre- 
senting Q, since this many parameters are needed to represent a member of 
the Stiefel manifold. We call these Q-input nodes for short. 

3. There are two types of computation nodes, addition and multiplication. In 
other words, we assume that we do not do divisions, square roots, etc. Since 
we are only doing matrix multiplication, this is reasonable. We note that any 
divisions or square roots in the overall algorithm may be done in order to 
compute the parameters represented Q. Omitting these from consideration 
only lowers our lower bound (though not by much). 

4. There are no branches in the algorithm. In other words, the way an entry of 
R is computed does not depend on the numerical values. This assumption 
reflects current algorithms, but could in fact be eliminated as explained later. 

5. Since the computation nodes only do multiplication and addition, we may 
view the output of each node as a polynomial in entries of A and parameters 
representing Q. 

6. We further restrict the operations performed so that the output of any node 
must be a homogeneous linear polynomial in the entries of A. In other words, 
we never multiply two quantities depending on entries of A to get a quadratic 
or higher order polynomial, or add a constant or parameter depending on 
Q to an entry of A. This is natural, since the ultimate output is linear and 
homogeneous in A, and any higher degree polynomial terms or constant terms 
would have to be canceled away. No current or foreseeable algorithm (even 
Strassen based) would do this, and numerical stability would likely be lost. 

7. There are j output nodes labeled by the entries of R, rij+i through rjj+i. 
The final requirement means that multiplication nodes are only allowed to multi- 
ply Q-input nodes and homogeneous linear functions of A, including ^-input nodes. 
Addition nodes may add homogeneous linear functions of A (again including A-input 
nodes) , but not add Q-input nodes to homogeneous linear functions of A. We exclude 
the possibility of adding or multiplying Q-input nodes, since the results of these could 
just be represented as additional Q-input nodes. 

Thus we see that the algorithm represented by the DAG just described outputs 
j polynomials that are homogeneous and linear in A. Let M be the total number of 
multiplication nodes in the DAG. We now want to argue that unless M > mj — j{j + 
l)/2, these output polynomials cannot possibly compute the right answer. We will do 
this by arguing that the dimension of a certain algebraic variety they define is both 
bounded above by M, and the dimension must be at least mj — j{j + l)/2 to get the 
right answer. 

Number the output nodes from 1 to j. The output polynomial representing node 
i can be written as X]fe=i Pk,i{Q)<ik,j+i, where Pk.iiQ) is a polynomial in the values of 
the Q-input nodes. According to our rules for DAGs above, only multiplication nodes 
can introduce a dependence on a previously unused input node, so all the Pk,i{Q) 
can only depend on M independent parameters. 

Finally, viewing each output node as a vector of m coefiicient polynomials 
{pi,i{Q), ■■■,Pm,i{Q)), we can view the entire output as a vector of mj coefiicient 
polynomials V{Q) = (pi,i(Q), ■■■,Pm,j{Q)), depending on M independent parameters. 
This vector of length mj needs to represent the set of all m-hy-j orthogonal matrices. 
But the Stiefel manifold of such orthogonal matrices has dimension mj—j{j + l)/2, so 
the surface defined by V has to have at least this dimension, i.e. M > mj — j{j + l)/2. 

As an extension, we could add branches to our algorithm by noting that the 
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output of our algorithm would be piecewise polynomials, on regions whose boundaries 
are themselves defined by varieties in the same homogeneous linear polynomials. We 
can apply the above argument on all the regions with nonempty interiors to argue 
that the same number of multiplications is needed. 
In summary, we have proven 

Lemma 5.5. Suppose we are doing the QR factorization of an m-by-n matrix 
using any "columnwise" algorithm in the sense described above. Then at least mj — 
iU + l)/2 multiplications are required to compute column J + 1 of R, and at least 
^(f + 1) rnultiplications to compute columns ^ + 1 through n of R. 

6. Related work. The central idea in this paper is factoring tall skinny matrices 
using a tree-based Householder QR algorithm. A number of authors previously figured 
out the special case of a binary reduction tree for parallel QR. As far as we know, 
Golub et al. [22 were the first to suggest it, but their formulation requires nlogP 
messages for QR of an m x n matrix on P processors. Pothen and Raghavan [58] were 
the first, as far as we can tell, to implement parallel TSQR using only log P messages. 
Da Cunha et al. 'TT\ independently rediscovered parallel TSQR. 

Other authors have worked out variations of the algorithm we call "sequential 
TSQR" [S H |25l [an gOl E]. They do not use it by itself, but rather as the panel 
factorization step in the QR decomposition of general matrices. The references [6l 
[71 [inj [311 HO] refer to the latter algorithm as "tiled QR," which is the same as our 
sequential CAQR with square blocks. However, they use it in parallel on shared- 
memory platforms, especially single-socket multicore. They do this by exploiting the 
parallelism implicit in the directed acyclic graph of tasks. Often they use dynamic 
task scheduling, which we could use but do not discuss in this paper. Since the 
cost of communication in the single-socket multicore regime is low, these authors are 
less concerned than we are about minimizing latency; thus, they are not concerned 
about the latency bottleneck in the panel factorization, which motivates our parallel 
CAQR algorithm. We also model and analyze communication costs in more detail 
than previous authors did. 

Here are recent examples of related work on sequential CAQR. Gunter and van de 
Geijn develop a parallel out-of-DRAM QR factorization algorithm that uses a flat tree 
for the panel factorizations [25]. Buttari et al. suggest using a QR factorization of this 
type to improve performance of parallel QR on commodity multicore processors [B]. 
Quintana-Orti et al. develop two variations on block QR factorization algorithms, and 
use them with a dynamic task scheduling system to parallelize the QR factorization 
on shared-memory machines [40] . Kurzak and Dongarra use similar algorithms, but 
with static task scheduling, to parallelize the QR factorization on Cell processors [3T] . 

As far as we know, parallel CAQR is novel. Nevertheless, there is a body of work 
on theoretical bounds on exploitable parallelism in QR factorizations. These bounds 
apply to both parallel TSQR and parallel CAQR if one replaces "matrix element" in 
the authors' work with "block" in ours. Cosnard, MuUer, and Robert proved lower 
bounds on the critical path length Opt(m, n) of any parallel QR algorithm of an 
m X n matrix based on Givens rotations [T0|; it is believed that these apply to any 
QR factorization based on Householder or Givens rotations. Leoncini et al. show 
that any QR factorization based on Householder reductions or Givens rotations is 
P-complete \^ . The only known QR factorization algorithm in arithmetic NC (see 
[llj ) is numerically highly unstable jl4j, and no work suggests that a stable arithmetic 
NC algorithm exists. 

Hong and Kung [2H| and Irony, Toledo, and Tiskin [571 proved lower bounds on 
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communication for sequential and parallel matrix multiplication. We are, as far as 
we know, the first to attempt extending these bounds to LU and QR factorization. 
Elmroth and Gustavson proposed a recursive QR factorization (see [TH] [15] ) which 
can also take advantage of memory hierarchies. It is future work to analyze whether 
their algorithm satisfies the same lower bounds on communication as does sequential 
CAQR. It is natural to ask to how much of dense linear algebra one can extend 
the results of this paper, that is finding algorithms that attain communication lower 
bounds. For parallel LU with pivoting, see the technical report by Grigori, Demmel, 
and Xiang [511, and for sequential LU, see [17]. 

Block iterative methods frequently compute the QR factorization of a tall and 
skinny dense matrix. This includes algorithms for solving linear systems Ax = B 
with multiple right-hand sides (such as variants of GMRES, QMR, or CG [iSll^lHTj l. 
as well as block iterative eigensolvers (for a summary of such methods, see [31 [32] )• 
In practice, modified Gram-Schmidt orthogonalization is usually used when a (rea- 
sonably) stable QR factorization is desired. Sometimes unstable methods (such as 
CholeskyQR) are used when performance considerations outweigh stability. Eigen- 
value computation is particularly sensitive to the accuracy of the orthogonalization; 
two recent papers suggest that large-scale eigenvalue applications require a stable QR 
factorization [26. .30) . Many block iterative methods have widely used implementa- 
tions, on which a large community of scientists and engineers depends for their compu- 
tational tasks. Examples include TRLAN (Thick Restart Lanczos), BLZPACK (Block 
Lanczos), Anasazi (various block methods), and PRIMME (block Jacobi-Davidson 
methods) [1S1|35J Hi HI 11 iS] . 

7. Conclusions and Open Problems. We have shown that known bandwidth 
lower bounds for parallel and sequential Q{n^) matrix multiplication imply latency 
lower bounds, shown such bounds apply to both LU and QR algorithms, presented 
some new and some old QR algorithms that attain these bounds, and referred to LU 
algorithms in the literature that attain at least some of these bounds. Whether a 
sequential LU algorithm exists attaining the latency lower bound is an open question. 

There are numerous ways in which one could hope to extend these results. One 
natural conjecture is that the bounds apply to other O(n^) dense linear algebra 
routines, such as eigenvalue problems, and if they do, we would want to find algo- 
rithms that attain them. Another question is finding analogous communication lower 
bounds for asymptotically faster dense linear algebra algorithms like thosed based 
on Strassen's algorithm, or indeed of any matrix multiplication algorithm, based on 
Raz's theorem converting any matrix multiplication algorithm to be "Strassen-like" 
(bilinear noncommutative) 42J. 

But the following question is of more practical importance. Our TSQR and 
CAQR algorithms have been described and analyzed in most detail for simple ma- 
chine models: either sequential with two levels of memory hierarchy (fast and slow), 
or a homogeneous parallel machine, where each processor is itself sequential. Real 
computers are more complicated, with many levels of memory hierarchy and many 
levels of parallelism (multicore, multisocket, multinode, multirack, . . . ) all with dif- 
ferent bandwidths and latencies. So it is natural to ask whether our algorithms and 
optimality proofs can be extended to these more general situations. We hinted at how 
TSQR could be extended to general reduction trees in Section [2] which could in turn 
be chosen depending on the architecture. But we have not discussed CAQR, which 
we do here. 

We again look at the simpler case of matrix multiplication for inspiration. Con- 
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sider the sequential case, with k levels of memory hierarchy instead of 2, where level 1 
is fastest and smallest with Wi words of memory, level 2 is slower and larger with W2 
words of memory, and so on, with level k being slowest and large enough to hold all 
the data. By dividing this hierarchy into two pieces, levels k through i + 1 ("slow") 
and i through 1 ("fast"), we can apply the theory in Section 5.1 to get lower bounds 
on bandwidth and latency for moving data between levels i and i + 1 of memory. So 
our goal expands to finding a matrix multiplication algorithm that attains not just 1 
set of lower bounds, but fc — 1 sets of lower bounds, one for each level of the hierarchy. 

Fortunately, as is well known, the standard approach to tiling matrix multiplica- 
tion achieves all these lower bounds simultaneously, by simply applying it recursively: 
level i + 1 holds submatrices of dimension Q{^/Wi+i), and multiplies them by tiling 
them into submatrices of dimension 6(\/Wi), and so on. 

The analogous observation is true of parallel matrix multiplication on a hierar- 
chical parallel processor where each node in the parallel processor is itself a parallel 
processor (multicore, multisocket, multirack, . . . ). 

We believe that this same recursive hierarchical approach applies to CAQR (and 
indeed much of linear algebra) but there is a catch: Simple recursion does not work, 
because the subtasks are not all simply smaller QR decompositions. Rather they are 
a mixture of tasks, including smaller QR decompositions and operations like matrix 
multiplication. Therefore we still expect that the same hierarchical approach will 
work: if a subtask is matrix multiplication then it will be broken into smaller matrix 
multiplications as described above, and if it is QR decomposition, it will be broken 
into smaller QR decompositions and matrix multiplications. 

There are various obstacles to this simple approach. First, the small QR decom- 
positions generally have structure, e.g., a pair of triangles. To exploit this structure 
fully would complicate the recursive decomposition. (Or we could ignore this struc- 
ture, perhaps only on the smaller subproblems, where the overhead would dominate.) 

Second, it suggests that the data structure with which the matrix is stored should 
be hierarchical as well, with matrices stored as subblocks of subblocks [2D]. This is 
certainly possible, but it differs significantly from the usual data structures to which 
users are accustomed. It also suggests that recent approaches based on decomposing 
dense linear algebra operations into DAGs of subtasks [6j [H |3T1 HOI |39] may need to 
be hierarchical, rather than have a single layer of tasks. A single layer is a good match 
for the single socket multicore architectures that motivate these systems, but may not 
scale well to, e.g., petascale architectures. 

Third, it is not clear whether this approach best accommodates machines that 
mix hierarchies of parallelism and memory. For example, a multicore / multisocket 
/ multirack computer will have also have disk, DRAM and various caches, and it 
remains to be seen whether straightforward recursion will minimize bandwidth and 
latency everywhere that communication takes place within such an architecture. 

Fourth and finally, all our analysis has assumed homogeneous machines, with the 
same flop rate, bandwidth and latency in all components. This assumption can be 
violated in many ways, for example, by asymmetric read and write bandwidths, by 
having different bandwidth and latency between racks, sockets, and cores on a single 
chip, or by having some specialized floating point units like GPUs. 

It is most likely that an adaptive, "autotuning" approach will be needed to deal 
with some of these issues, just as it has been used for the simpler case of a matrix 
multiplication. Addressing all these issues is future work. 
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